### Load standardpackages
library(tidyverse) # Collection of all the good stuff like dplyr, ggplot2 ect.
library(magrittr) # For extra-piping operators (eg. %<>%)

This session

In this applied session, you will:

  1. xxx
  2. xxx
  3. xxx

Refresher: Basics of String Manupilation

We start by taking a piece of text and turning it into something that carries the meaning of the initial text but is less noisy and thus perhaps easier to “understand” by a computer.

text <- "The Eton-educated, non-binary British Iraqi had always struggled with their identity, until they discovered drag. Yet the 29 year old says the performances come at a high price"
# Transforming to lower case
text %>% str_to_lower()
[1] "the eton-educated, non-binary british iraqi had always struggled with their identity, until they discovered drag. yet the 29 year old says the performances come at a high price"
# Split by '.' (=sentence)
text %>% str_split('\\.')
[[1]]
[1] "The Eton-educated, non-binary British Iraqi had always struggled with their identity, until they discovered drag"
[2] " Yet the 29 year old says the performances come at a high price"                                                 
text %>% str_replace_all('o', 'O')
[1] "The EtOn-educated, nOn-binary British Iraqi had always struggled with their identity, until they discOvered drag. Yet the 29 year Old says the perfOrmances cOme at a high price"
# Split by ' ' (=word)
text %>% str_remove_all('[[:punct:]]') %>% str_split(' ') %>% unlist()
 [1] "The"          "Etoneducated" "nonbinary"    "British"      "Iraqi"        "had"          "always"       "struggled"    "with"        
[10] "their"        "identity"     "until"        "they"         "discovered"   "drag"         "Yet"          "the"          "29"          
[19] "year"         "old"          "says"         "the"          "performances" "come"         "at"           "a"            "high"        
[28] "price"       
text %>% str_to_lower() %>% str_remove_all('[[:punct:]]') %>% str_split(' ') 
[[1]]
 [1] "the"          "etoneducated" "nonbinary"    "british"      "iraqi"        "had"          "always"       "struggled"    "with"        
[10] "their"        "identity"     "until"        "they"         "discovered"   "drag"         "yet"          "the"          "29"          
[19] "year"         "old"          "says"         "the"          "performances" "come"         "at"           "a"            "high"        
[28] "price"       

The R NLP ecosystem

  • Most language analysis approaches are based on the analysis of texts word-by-word.
  • Here, their order might matter (word sequence models) or not (bag-of-words models), but the smallest unit of analysis is usually the word.
  • This is usually done in context of the document the word appeared in. Therefore, on first glance three types datastructures make sense:
  1. Tidy: Approach, where data is served in a 2-column document-word format (e.g., tidytext)
  2. Token lists: Creation of special objects, saved as document-token lists or corpus (e.g., tm, quanteda)
  3. Matrix: Long approach, where data is served as document-term matrix, term-frequency matrix, etc.
  • Different forms of analysis (and the packages used therefore) favor different structures, so we need to be fluent in transfering original raw-text in * These formats, as well as switching between them. (for more infos, check here).

Tidy Text Formats

  • While there exist other ecosystems to do txt analysis (e.g., tm, quanteda), I will here almost exclusively use tidytext, which is very simple yet powerful, very well documented, and works very neathly with tidymodels and the rest of the tidyverse ecosystem.
library(tidytext)
  • While we will for later applications we will use different formats, we here will limit ourselves to word token, which can do most of the simple jobs.
  • Here, we apply tidy principles to text, make word-token per document our unit of analysis.
  • Therefore, every row repreesents a word per document. This sounds like a lot of redundancy, but makes it very easy to work with compared to more complez matrix and list formats. Here, we can do our usual sumarries and visualizations pretty much out-of-the-box.
text_tbl <- tibble(id = 1, text = text)
# We now unnest the tokens. Notice it is by default deleting all punctuation and transforming the text to lower chars.
text_tidy <- text_tbl %>% unnest_tokens(word, text, token = 'words')
  • Overall, in NLP we are trying to represent meaning structure.
  • That means that we want to focus on the most important and “meaning-bearing elements” in text, while reducing noise.
  • Words such as “and”, “have”, “the” may have central syntactic functions but are not particularly important from a semantic perspective.
# Defining stopwords
stop_words
text_tidy %<>%
  anti_join(stop_words, by = 'word')
text_tidy
# We now unnest the tokens. Notice it is by default deleting all punctuation and transforming the text to lower chars.
sentences_tidy <- text_tbl %>% unnest_tokens(word, text, token = 'sentences')
sentences_tidy

Your turn!

Take the following text and transform it into a list of lists with with each element being a tokenized sentence. Remove stopwords, lower all tokens and keep only alpha-numeric tokens.

I’ve been called many things in my life, but never an optimist. That was fine by me. I believed pessimists lived in a constant state of pleasant surprise: if you always expected the worst, things generally turned out better than you imagined. The only real problem with pessimism, I figured, was that too much of it could accidentally turn you into an optimist.

source: https://www.theguardian.com/global/2019/nov/21/glass-half-full-how-i-learned-to-be-an-optimist-in-a-week

data <- read_csv('https://sds-aau.github.io/SDS-master/00_data/xxx.csv') 

Endnotes

Your turn

Please do Exercise 1 in the corresponding section on Github.

Packages & Ecosystem

Suggestions for further study

  • DataCamp (!All courses have somewhat outdated ecosystems)

Session Info

sessionInfo()
LS0tCnRpdGxlOiAnSW50cm9kdWN0aW9uIHRvIE5hdHVyYWwtbGFuZ3VhZ2UtUHJvY2Vzc2luZyAoUiknCmF1dGhvcjogIkRhbmllbCBTLiBIYWluIChkc2hAYnVzaW5lc3MuYWF1LmRrKSIKZGF0ZTogIlVwZGF0ZWQgYHIgZm9ybWF0KFN5cy50aW1lKCksICclQiAlZCwgJVknKWAiCm91dHB1dDoKICBodG1sX25vdGVib29rOgogICAgY29kZV9mb2xkaW5nOiBzaG93CiAgICBkZl9wcmludDogcGFnZWQKICAgIHRvYzogdHJ1ZQogICAgdG9jX2RlcHRoOiAyCiAgICB0b2NfZmxvYXQ6CiAgICAgIGNvbGxhcHNlZDogZmFsc2UKICAgIHRoZW1lOiBmbGF0bHkKLS0tCgpgYGB7ciBzZXR1cCwgaW5jbHVkZT1GQUxTRX0KIyMjIEdlbmVyaWMgcHJlYW1ibGUKcm0obGlzdD1scygpKQpTeXMuc2V0ZW52KExBTkcgPSAiZW4iKSAjIEZvciBlbmdsaXNoIGxhbmd1YWdlCm9wdGlvbnMoc2NpcGVuID0gNSkgIyBUbyBkZWFjdGl2YXRlIGFubm95aW5nIHNjaWVudGlmaWMgbnVtYmVyIG5vdGF0aW9uCgojIyMgS25pdHIgb3B0aW9ucwpsaWJyYXJ5KGtuaXRyKSAjIEZvciBkaXNwbGF5IG9mIHRoZSBtYXJrZG93bgprbml0cjo6b3B0c19jaHVuayRzZXQod2FybmluZz1GQUxTRSwKICAgICAgICAgICAgICAgICAgICAgbWVzc2FnZT1GQUxTRSwKICAgICAgICAgICAgICAgICAgICAgY29tbWVudD1GQUxTRSwgCiAgICAgICAgICAgICAgICAgICAgIGZpZy5hbGlnbj0iY2VudGVyIgogICAgICAgICAgICAgICAgICAgICApCmBgYAoKYGBge3J9CiMjIyBMb2FkIHN0YW5kYXJkcGFja2FnZXMKbGlicmFyeSh0aWR5dmVyc2UpICMgQ29sbGVjdGlvbiBvZiBhbGwgdGhlIGdvb2Qgc3R1ZmYgbGlrZSBkcGx5ciwgZ2dwbG90MiBlY3QuCmxpYnJhcnkobWFncml0dHIpICMgRm9yIGV4dHJhLXBpcGluZyBvcGVyYXRvcnMgKGVnLiAlPD4lKQpgYGAKCiMjIyBUaGlzIHNlc3Npb24KCkluIHRoaXMgYXBwbGllZCBzZXNzaW9uLCB5b3Ugd2lsbDoKCjEuIHh4eAoyLiB4eHgKMy4geHh4CgoKIyBSZWZyZXNoZXI6IEJhc2ljcyBvZiBTdHJpbmcgTWFudXBpbGF0aW9uCgpXZSBzdGFydCBieSB0YWtpbmcgYSBwaWVjZSBvZiB0ZXh0IGFuZCB0dXJuaW5nIGl0IGludG8gc29tZXRoaW5nIHRoYXQgY2FycmllcyB0aGUgbWVhbmluZyBvZiB0aGUgaW5pdGlhbCB0ZXh0IGJ1dCBpcyBsZXNzIG5vaXN5IGFuZCB0aHVzIHBlcmhhcHMgZWFzaWVyIHRvICJ1bmRlcnN0YW5kIiBieSBhIGNvbXB1dGVyLgoKYGBge3J9CnRleHQgPC0gIlRoZSBFdG9uLWVkdWNhdGVkLCBub24tYmluYXJ5IEJyaXRpc2ggSXJhcWkgaGFkIGFsd2F5cyBzdHJ1Z2dsZWQgd2l0aCB0aGVpciBpZGVudGl0eSwgdW50aWwgdGhleSBkaXNjb3ZlcmVkIGRyYWcuIFlldCB0aGUgMjkgeWVhciBvbGQgc2F5cyB0aGUgcGVyZm9ybWFuY2VzIGNvbWUgYXQgYSBoaWdoIHByaWNlIgpgYGAKCmBgYHtyfQojIFRyYW5zZm9ybWluZyB0byBsb3dlciBjYXNlCnRleHQgJT4lIHN0cl90b19sb3dlcigpCgpgYGAKCmBgYHtyfQojIFNwbGl0IGJ5ICcuJyAoPXNlbnRlbmNlKQp0ZXh0ICU+JSBzdHJfc3BsaXQoJ1xcLicpCmBgYAoKYGBge3J9CnRleHQgJT4lIHN0cl9yZXBsYWNlX2FsbCgnbycsICdPJykKYGBgCgpgYGB7cn0KIyBTcGxpdCBieSAnICcgKD13b3JkKQp0ZXh0ICU+JSBzdHJfcmVtb3ZlX2FsbCgnW1s6cHVuY3Q6XV0nKSAlPiUgc3RyX3NwbGl0KCcgJykgJT4lIHVubGlzdCgpCmBgYAoKYGBge3J9CnRleHQgJT4lIHN0cl90b19sb3dlcigpICU+JSBzdHJfcmVtb3ZlX2FsbCgnW1s6cHVuY3Q6XV0nKSAlPiUgc3RyX3NwbGl0KCcgJykgCmBgYAoKIyBUaGUgUiBOTFAgZWNvc3lzdGVtIAoKKiBNb3N0IGxhbmd1YWdlIGFuYWx5c2lzIGFwcHJvYWNoZXMgYXJlIGJhc2VkIG9uIHRoZSBhbmFseXNpcyBvZiB0ZXh0cyB3b3JkLWJ5LXdvcmQuIAoqIEhlcmUsIHRoZWlyIG9yZGVyIG1pZ2h0IG1hdHRlciAod29yZCBzZXF1ZW5jZSBtb2RlbHMpIG9yIG5vdCAoYmFnLW9mLXdvcmRzIG1vZGVscyksIGJ1dCB0aGUgc21hbGxlc3QgdW5pdCBvZiBhbmFseXNpcyBpcyB1c3VhbGx5IHRoZSB3b3JkLiAKKiBUaGlzIGlzIHVzdWFsbHkgZG9uZSBpbiBjb250ZXh0IG9mIHRoZSBkb2N1bWVudCB0aGUgd29yZCBhcHBlYXJlZCBpbi4gVGhlcmVmb3JlLCBvbiBmaXJzdCBnbGFuY2UgdGhyZWUgdHlwZXMgZGF0YXN0cnVjdHVyZXMgbWFrZSBzZW5zZToKCjEuICoqVGlkeToqKiAgQXBwcm9hY2gsIHdoZXJlIGRhdGEgaXMgc2VydmVkIGluIGEgMi1jb2x1bW4gZG9jdW1lbnQtd29yZCBmb3JtYXQgKGUuZy4sIGB0aWR5dGV4dGApCjIuICoqVG9rZW4gbGlzdHM6KiogQ3JlYXRpb24gb2Ygc3BlY2lhbCBvYmplY3RzLCBzYXZlZCBhcyBkb2N1bWVudC10b2tlbiBsaXN0cyBvciBjb3JwdXMgKGUuZy4sIGB0bWAsIGBxdWFudGVkYWApCjMuICoqTWF0cml4OioqIExvbmcgYXBwcm9hY2gsIHdoZXJlIGRhdGEgaXMgc2VydmVkIGFzIGRvY3VtZW50LXRlcm0gbWF0cml4LCB0ZXJtLWZyZXF1ZW5jeSBtYXRyaXgsIGV0Yy4KCiogRGlmZmVyZW50IGZvcm1zIG9mIGFuYWx5c2lzIChhbmQgdGhlIHBhY2thZ2VzIHVzZWQgdGhlcmVmb3JlKSBmYXZvciBkaWZmZXJlbnQgc3RydWN0dXJlcywgc28gd2UgbmVlZCB0byBiZSBmbHVlbnQgaW4gdHJhbnNmZXJpbmcgb3JpZ2luYWwgcmF3LXRleHQgaW4gKiBUaGVzZSBmb3JtYXRzLCBhcyB3ZWxsIGFzIHN3aXRjaGluZyBiZXR3ZWVuIHRoZW0uIChmb3IgbW9yZSBpbmZvcywgY2hlY2sgW2hlcmVdKGh0dHBzOi8vd3d3LnRpZHl0ZXh0bWluaW5nLmNvbS9kdG0uaHRtbCkpLgoKIVtdKGh0dHBzOi8vc2RzLWFhdS5naXRodWIuaW8vU0RTLW1hc3Rlci8wMF9tZWRpYS9ubHBfdGlkeXdvcmtmbG93LnBuZykKCiMjIFRpZHkgVGV4dCBGb3JtYXRzCgoqIFdoaWxlIHRoZXJlIGV4aXN0IG90aGVyIGVjb3N5c3RlbXMgdG8gZG8gdHh0IGFuYWx5c2lzIChlLmcuLCBgdG1gLCBgcXVhbnRlZGFgKSwgSSB3aWxsIGhlcmUgYWxtb3N0IGV4Y2x1c2l2ZWx5IHVzZSBgdGlkeXRleHRgLCB3aGljaCBpcyB2ZXJ5IHNpbXBsZSB5ZXQgcG93ZXJmdWwsIHZlcnkgd2VsbCBkb2N1bWVudGVkLCBhbmQgd29ya3MgdmVyeSBuZWF0aGx5IHdpdGggYHRpZHltb2RlbHNgIGFuZCB0aGUgcmVzdCBvZiB0aGUgYHRpZHl2ZXJzZWAgZWNvc3lzdGVtLgoKCmBgYHtyfQpsaWJyYXJ5KHRpZHl0ZXh0KQpgYGAKCiogV2hpbGUgd2Ugd2lsbCBmb3IgbGF0ZXIgYXBwbGljYXRpb25zIHdlIHdpbGwgdXNlIGRpZmZlcmVudCBmb3JtYXRzLCB3ZSBoZXJlIHdpbGwgbGltaXQgb3Vyc2VsdmVzIHRvIHdvcmQgdG9rZW4sIHdoaWNoIGNhbiBkbyBtb3N0IG9mIHRoZSBzaW1wbGUgam9icy4KKiBIZXJlLCB3ZSBhcHBseSB0aWR5IHByaW5jaXBsZXMgdG8gdGV4dCwgbWFrZSB3b3JkLXRva2VuIHBlciBkb2N1bWVudCBvdXIgdW5pdCBvZiBhbmFseXNpcy4KKiBUaGVyZWZvcmUsIGV2ZXJ5IHJvdyByZXByZWVzZW50cyBhIHdvcmQgcGVyIGRvY3VtZW50LgpUaGlzIHNvdW5kcyBsaWtlIGEgbG90IG9mIHJlZHVuZGFuY3ksIGJ1dCBtYWtlcyBpdCB2ZXJ5IGVhc3kgdG8gd29yayB3aXRoIGNvbXBhcmVkIHRvIG1vcmUgY29tcGxleiBtYXRyaXggYW5kIGxpc3QgZm9ybWF0cy4gSGVyZSwgd2UgY2FuIGRvIG91ciB1c3VhbCBzdW1hcnJpZXMgYW5kIHZpc3VhbGl6YXRpb25zIHByZXR0eSBtdWNoIG91dC1vZi10aGUtYm94LgoKYGBge3J9CiMgVGlkeXRleHQgd2FudHMgYSB0aWJibGUgYXMgcG9pbnQgb2YgZGVwYXJ0dXJlCnRleHRfdGJsIDwtIHRpYmJsZShpZCA9IDEsIHRleHQgPSB0ZXh0KQpgYGAKCmBgYHtyfQojIFdlIG5vdyB1bm5lc3QgdGhlIHRva2Vucy4gTm90aWNlIGl0IGlzIGJ5IGRlZmF1bHQgZGVsZXRpbmcgYWxsIHB1bmN0dWF0aW9uIGFuZCB0cmFuc2Zvcm1pbmcgdGhlIHRleHQgdG8gbG93ZXIgY2hhcnMuCnRleHRfdGlkeSA8LSB0ZXh0X3RibCAlPiUgdW5uZXN0X3Rva2Vucyh3b3JkLCB0ZXh0LCB0b2tlbiA9ICd3b3JkcycpCmBgYAoKKiBPdmVyYWxsLCBpbiBOTFAgd2UgYXJlIHRyeWluZyB0byByZXByZXNlbnQgbWVhbmluZyBzdHJ1Y3R1cmUuIAoqIFRoYXQgbWVhbnMgdGhhdCB3ZSB3YW50IHRvIGZvY3VzIG9uIHRoZSBtb3N0IGltcG9ydGFudCBhbmQgIm1lYW5pbmctYmVhcmluZyBlbGVtZW50cyIgaW4gdGV4dCwgd2hpbGUgcmVkdWNpbmcgbm9pc2UuIAoqIFdvcmRzIHN1Y2ggYXMgImFuZCIsICJoYXZlIiwgInRoZSIgbWF5IGhhdmUgY2VudHJhbCBzeW50YWN0aWMgZnVuY3Rpb25zIGJ1dCBhcmUgbm90IHBhcnRpY3VsYXJseSBpbXBvcnRhbnQgZnJvbSBhIHNlbWFudGljIHBlcnNwZWN0aXZlLgoKYGBge3J9CiMgVGlkeXRleHQgY29tZXMgd2l0aCBhIHN0b3B3b3JkIGxleGljb24Kc3RvcF93b3JkcwpgYGAKCmBgYHtyfQp0ZXh0X3RpZHkgJTw+JQogIGFudGlfam9pbihzdG9wX3dvcmRzLCBieSA9ICd3b3JkJykKYGBgCgpgYGB7cn0KdGV4dF90aWR5CmBgYAoKCmBgYHtyfQojIFdlIG5vdyB1bm5lc3QgdGhlIHRva2Vucy4gTm90aWNlIGl0IGlzIGJ5IGRlZmF1bHQgZGVsZXRpbmcgYWxsIHB1bmN0dWF0aW9uIGFuZCB0cmFuc2Zvcm1pbmcgdGhlIHRleHQgdG8gbG93ZXIgY2hhcnMuCnNlbnRlbmNlc190aWR5IDwtIHRleHRfdGJsICU+JSB1bm5lc3RfdG9rZW5zKHdvcmQsIHRleHQsIHRva2VuID0gJ3NlbnRlbmNlcycpCmBgYAoKYGBge3J9CnNlbnRlbmNlc190aWR5CmBgYAoKIyMgWW91ciB0dXJuIQoKIVtdKGh0dHBzOi8vbWVkaWEuZ2lwaHkuY29tL21lZGlhLzlyd0ZmbUIycUowbUVzbWtmai9naXBoeS5naWYpCgpUYWtlIHRoZSBmb2xsb3dpbmcgdGV4dCBhbmQgdHJhbnNmb3JtIGl0IGludG8gYSBsaXN0IG9mIGxpc3RzIHdpdGggd2l0aCBlYWNoIGVsZW1lbnQgYmVpbmcgYSB0b2tlbml6ZWQgc2VudGVuY2UuIFJlbW92ZSBzdG9wd29yZHMsIGxvd2VyIGFsbCB0b2tlbnMgYW5kIGtlZXAgb25seSBhbHBoYS1udW1lcmljIHRva2Vucy4KCmBJ4oCZdmUgYmVlbiBjYWxsZWQgbWFueSB0aGluZ3MgaW4gbXkgbGlmZSwgYnV0IG5ldmVyIGFuIG9wdGltaXN0LiBUaGF0IHdhcyBmaW5lIGJ5IG1lLiBJIGJlbGlldmVkIHBlc3NpbWlzdHMgbGl2ZWQgaW4gYSBjb25zdGFudCBzdGF0ZSBvZiBwbGVhc2FudCBzdXJwcmlzZTogaWYgeW91IGFsd2F5cyBleHBlY3RlZCB0aGUgd29yc3QsIHRoaW5ncyBnZW5lcmFsbHkgdHVybmVkIG91dCBiZXR0ZXIgdGhhbiB5b3UgaW1hZ2luZWQuIFRoZSBvbmx5IHJlYWwgcHJvYmxlbSB3aXRoIHBlc3NpbWlzbSwgSSBmaWd1cmVkLCB3YXMgdGhhdCB0b28gbXVjaCBvZiBpdCBjb3VsZCBhY2NpZGVudGFsbHkgdHVybiB5b3UgaW50byBhbiBvcHRpbWlzdC5gCgpzb3VyY2U6IGh0dHBzOi8vd3d3LnRoZWd1YXJkaWFuLmNvbS9nbG9iYWwvMjAxOS9ub3YvMjEvZ2xhc3MtaGFsZi1mdWxsLWhvdy1pLWxlYXJuZWQtdG8tYmUtYW4tb3B0aW1pc3QtaW4tYS13ZWVrCgoKCgoKIVtdKGh0dHBzOi8vc2RzLWFhdS5naXRodWIuaW8vU0RTLW1hc3Rlci8wMF9tZWRpYS94eHguanBnKQoKYGBge3J9CmRhdGEgPC0gcmVhZF9jc3YoJ2h0dHBzOi8vc2RzLWFhdS5naXRodWIuaW8vU0RTLW1hc3Rlci8wMF9kYXRhL3h4eC5jc3YnKSAKYGBgCgoKCgojIEVuZG5vdGVzCgojIyBZb3VyIHR1cm4KUGxlYXNlIGRvICoqRXhlcmNpc2UgMSoqIGluIHRoZSBjb3JyZXNwb25kaW5nIHNlY3Rpb24gb24gYEdpdGh1YmAuCgojIyMgUGFja2FnZXMgJiBFY29zeXN0ZW0KCiogdGlkeWdyYXBoIFtoZXJlXShodHRwczovL3RpZHlncmFwaC5kYXRhLWltYWdpbmlzdC5jb20vKQoqIGdncmFwaCBbaGVyZV0oaHR0cHM6Ly93d3cuZGF0YS1pbWFnaW5pc3QuY29tLzIwMTcvYW5ub3VuY2luZy1nZ3JhcGgvKQoKIyMjIFN1Z2dlc3Rpb25zIGZvciBmdXJ0aGVyIHN0dWR5CgoqIERhdGFDYW1wICghQWxsIGNvdXJzZXMgaGF2ZSBzb21ld2hhdCBvdXRkYXRlZCBlY29zeXN0ZW1zKQogICAqIFtOZXR3b3JrIEFuYWx5c2lzIGluIFJdKGh0dHBzOi8vbGVhcm4uZGF0YWNhbXAuY29tL2NvdXJzZXMvbmV0d29yay1hbmFseXNpcy1pbi1yKTogR29vZCBmb3Igc29tZSBvZiB0aGUgYmFzaWNzICAgICAgICAgICAgICAgICAgIAoKICAKIyMjIFNlc3Npb24gSW5mbwoKYGBge3J9CnNlc3Npb25JbmZvKCkKYGBgCgoKCgoK